55 research outputs found

    Nommage non supervisé des personnes dans les émissions de télévision. Utilisation des noms écrits, des noms prononcés ou des deux ?

    Get PDF
    National audienceL'identification de personnes dans les émissions de télévision est un outil précieux pour l'indexation de ce type de vidéos mais l'utilisation de modèles biométriques n'est pas une option viable sans connaissance a priori des personnes présentes dans les vidéos. Les noms prononcés ou écrits peuvent nous fournir une liste de noms hypothèses. Nous proposons une comparaison du potentiel de ces deux modalités (noms prononcés ou écrits) afin d'extraire le nom des personnes parlant et/ou apparaissant. Les noms prononcés proposent un plus grand nombre d'occurrences de citation mais les erreurs de transcription et de détection de ces noms réduisent de moitié le potentiel de cette modalité. Les noms écrits bénéficient d'une amélioration croissante de la qualité des vidéos et sont plus facilement détectés. Par ailleurs, l'affiliation aux locuteurs/visages des noms écrits reste plus simple que pour les noms prononcés

    Unsupervised Speaker Identification in TV Broadcast Based on Written Names

    No full text
    International audienceIdentifying speakers in TV broadcast in an unsuper- vised way (i.e. without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too imprecise for having sufficient confidence. To overcome this issue, another source of names can be used: the names written in a title block in the image track. We first compared these two sources of names on their abilities to provide the name of the speakers in TV broadcast. This study shows that it is more interesting to use written names for their high precision for identifying the current speaker. We also propose two approaches for finding speaker identity based only on names written in the image track. With the "late naming" approach, we propose different propagations of written names onto clusters. Our second proposition, "Early naming", modifies the speaker diarization module (agglomerative clustering) by adding constraints preventing two clusters with different associated written names to be merged together. These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. Our best "late naming" system reaches an F-measure of 73.1%. "early naming" improves over this result both in terms of identification error rate and of stability of the clustering stopping criterion. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.2% F-measure

    Nommage non-supervisé des personnes dans les émissions de télévision: une revue du potentiel de chaque modalité

    No full text
    National audienceersons identification in TV broadcast is a valuable tool for indexing these videos. But the use of biometric models is an unsustainable option without a priori knowledge of people present in the videos. The names pronounced or written on the screen can provide us a list of hypotheses names. We propose a comparison of the potential of these two modalities (names pronounced or written) to extract the true names of the speakers and/or faces. The names pro- nounced offer many instance of citation but transcription and detection errors of these names halved the potential of this modality. The names written benefits of the video quality improve- ment and there are easy to find. The affiliation to speakers/faces of names written is simpler than for names pronounced.L'identification de personnes dans les émissions de télévision est un outil précieux pour l'indexation de ce type de vidéos. Mais l'utilisation de modèles biométriques n'est pas une op- tion viable sans connaissance a priori des personnes présentes dans les vidéos. Les noms cités à l'oral ou écrits à l'écran peuvent nous fournir une liste de noms hypothèses. Nous proposons une comparaison du potentiel de ces deux modalités (noms cités ou écrits) afin d'extraire le nom des personnes parlant et/ou apparaissant. Les noms cités à l'oral proposent un plus grand nombre d'occurrences de citation mais les erreurs de transcriptions et de détections de ces noms réduisent de moitié le potentiel de cette modalité. Les noms écrits à l'écran bénéficient d'une amélioration croissante de la qualité des vidéos et sont plus facilement détectés. L'affiliation aux locuteurs/visages des noms écrits reste plus simple que pour les noms cités à l'oral

    Active Selection with Label Propagation for Minimizing Human Effort in Speaker Annotation of TV Shows

    No full text
    International audienceIn this paper an approach minimizing the human involvement in the manual annotation of speakers is presented. At each iter- ation a selection strategy choses the most suitable speech track for manual annotation, which is then associated with all the tracks in the cluster that contains it. The study makes use of a system that propagates the speaker track labels. This is done using a agglomerative clustering with constraints. Several dif- ferent unsupervised active learning selection strategies are eval- uated. Additionally, the presented approach can be used to ef- ficiently generate sets of speech tracks for training biometric models. In this case both the length of the speech track for a given person and its purity are taken into consideration. To evaluate the system the REPERE video corpus was used. Along with the speech tracks extracted from the videos, the op- tical character recognition system was adapted to extract names of potential speakers. This was then used as the 'cold start' for the selection method

    Automatic propagation of manual annotations for multimodal person identification in TV shows

    No full text
    International audienceIn this paper an approach to human annotation propagation for person identification in the multimodal context is proposed. A system is used, which combines speaker diarization and face clustering to produce multimodal clusters. The whole multimodal clusters are later annotated rather than just single tracks, which is done by propagation. Optical character recogni- tion systems provides initial annotation. Four different strategies, which select candidates for annotation, are tested. The initial results of annotation propagation are promising. With the use of a proper active learning selection strategy the human annotator involvement could be reduced even further

    Towards a better integration of written names for unsupervised speakers identification in videos

    No full text
    International audienceExisting methods for unsupervised identification of speakers in TV broadcast usually rely on the output of a speaker diariza- tion module and try to name each cluster using names provided by another source of information: we call it "late naming". Hence, written names extracted from title blocks tend to lead to high precision identification, although they cannot correct er- rors made during the clustering step. In this paper, we extend our previous "late naming" ap- proach in two ways: "integrated naming" and "early naming". While "late naming" relies on a speaker diarization module op- timized for speaker diarization, "integrated naming" jointly op- timize speaker diarization and name propagation in terms of identification errors. "Early naming" modifies the speaker di- arization module by adding constraints preventing two clusters with different written names to be merged together. While "integrated naming" yields similar identification per- formance as "late naming" (with better precision), "early nam- ing" improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion

    Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both ?

    Get PDF
    International audiencePersons identification in video from TV broadcast is a valuable tool for indexing them. However, the use of biometric mod- els is not a very sustainable option without a priori knowledge of people present in the videos. The pronounced names (PN) or written names (WN) on the screen can provide hypotheses names for speakers. We propose an experimental comparison of the potential of these two modalities (names pronounced or written) to extract the true names of the speakers. The names pronounced offer many instances of citation but transcription and named-entity detection errors halved the potential of this modality. On the contrary, the written names detection benefits of the video quality improvement and is nowadays rather robust and efficient to name speakers. Oracle experiments presented for the mapping between written names and speakers also show the complementarity of both PN and WN modalities

    Nommage non-supervisé des personnes dans les émissions de télévision: une revue du potentiel de chaque modalité

    Get PDF
    National audienceersons identification in TV broadcast is a valuable tool for indexing these videos. But the use of biometric models is an unsustainable option without a priori knowledge of people present in the videos. The names pronounced or written on the screen can provide us a list of hypotheses names. We propose a comparison of the potential of these two modalities (names pronounced or written) to extract the true names of the speakers and/or faces. The names pro- nounced offer many instance of citation but transcription and detection errors of these names halved the potential of this modality. The names written benefits of the video quality improve- ment and there are easy to find. The affiliation to speakers/faces of names written is simpler than for names pronounced.L'identification de personnes dans les émissions de télévision est un outil précieux pour l'indexation de ce type de vidéos. Mais l'utilisation de modèles biométriques n'est pas une op- tion viable sans connaissance a priori des personnes présentes dans les vidéos. Les noms cités à l'oral ou écrits à l'écran peuvent nous fournir une liste de noms hypothèses. Nous proposons une comparaison du potentiel de ces deux modalités (noms cités ou écrits) afin d'extraire le nom des personnes parlant et/ou apparaissant. Les noms cités à l'oral proposent un plus grand nombre d'occurrences de citation mais les erreurs de transcriptions et de détections de ces noms réduisent de moitié le potentiel de cette modalité. Les noms écrits à l'écran bénéficient d'une amélioration croissante de la qualité des vidéos et sont plus facilement détectés. L'affiliation aux locuteurs/visages des noms écrits reste plus simple que pour les noms cités à l'oral

    The CAMOMILE collaborative annotation platform for multi-modal, multi-lingual and multi-media documents

    Get PDF
    In this paper, we describe the organization and the implementation of the CAMOMILE collaborative annotation framework for multimodal, multimedia, multilingual (3M) data. Given the versatile nature of the analysis which can be performed on 3M data, the structure of the server was kept intentionally simple in order to preserve its genericity, relying on standard Web technologies. Layers of annotations, defined as data associated to a media fragment from the corpus, are stored in a database and can be managed through standard interfaces with authentication. Interfaces tailored specifically to the needed task can then be developed in an agile way, relying on simple but reliable services for the management of the centralized annotations. We then present our implementation of an active learning scenario for person annotation in video, relying on the CAMOMILE server; during a dry run experiment, the manual annotation of 716 speech segments was thus propagated to 3504 labeled tracks. The code of the CAMOMILE framework is distributed in open source.Peer ReviewedPostprint (author's final draft

    Active Selection with Label Propagation for Minimizing Human Effort in Speaker Annotation of TV Shows

    Get PDF
    International audienceIn this paper an approach minimizing the human involvement in the manual annotation of speakers is presented. At each iter- ation a selection strategy choses the most suitable speech track for manual annotation, which is then associated with all the tracks in the cluster that contains it. The study makes use of a system that propagates the speaker track labels. This is done using a agglomerative clustering with constraints. Several dif- ferent unsupervised active learning selection strategies are eval- uated. Additionally, the presented approach can be used to ef- ficiently generate sets of speech tracks for training biometric models. In this case both the length of the speech track for a given person and its purity are taken into consideration. To evaluate the system the REPERE video corpus was used. Along with the speech tracks extracted from the videos, the op- tical character recognition system was adapted to extract names of potential speakers. This was then used as the 'cold start' for the selection method
    corecore